Methods for the Extraction of Hungarian Multi-Word Lexemes

نویسندگان

  • Balázs Kis
  • Begoña Villada
  • Gosse Bouma
  • Gábor Ugray
  • Tamás Bíró
  • Gábor Pohl
  • John Nerbonne
چکیده

This paper describes an experiment on extracting Hungarian multi-word lexemes from a corpus, using statistical methods. Corpus preparation—the addition of POS tags and stems—was done automatically. From the corpus, 〈verb+noun+casemark〉 patterns were extracted as collocation candidates. Evaluation shows that the statistical methods used by Villada Moirón (2004a) to identify Dutch V + PP collocations, can also be applied to the Hungarian data. Some collocation types (such as verbal arguments) require special extraction methods, as explained in the evaluation section. Finally, we suggest that the extraction process can be further improved by a blend of statistical techniques with rule-based and dictionary-based methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Approach to the Corpus-based Statistical Investigation of Hungarian Multi-word Lexemes

We apply statistical methods to perform automatic extraction of Hungarian collocations from corpora. Due to the complexity of Hungarian morphology, a complex resource preparation tool chain has been developed. This tool chain implements a reusable and, in principle, language independent framework. In the first part, the paper describes the tool chain itself, then, in the second part, an experim...

متن کامل

Using local rules for disambiguation of homographs in Hungarian corpora

The historical corpus of Hungarian contains about 20 million running words at the moment. To be able to retrieve the occurrences of the lexemes, a morphological analyser programme was developed which is able to segment the running words and identifies the lexeme and the suffixes. Over 30% of the running words can have more then one correct analysis. Therefore we are aiming to develop methods fo...

متن کامل

CoCoCo: Online Extraction of Russian Multiword Expressions

In the CoCoCo project we develop methods to extract multi-word expressions of various kinds—idioms, multi-word lexemes, collocations, and colligations—and to evaluate their linguistic stability in a common, uniform fashion. In this paper we introduce a Web interface, which provides the user with access to these measures, to query Russian-language corpora. Potential users of these tools include ...

متن کامل

Exploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation

In this paper we present an experiment to automatically generate annotated training corpora for a supervised word sense disambiguation module operating in an English-Hungarian and a Hungarian-English machine translation system. Training examples for the WSD module are produced by annotating ambiguous lexical items in the source language (words having several possible translations) with their pr...

متن کامل

Idarex: Formal Description of Multi-word Lexemes with Regular Expressions

Most multi-word lexemes (MWLs) allow certain types of variation. This has to be taken into account for their description and their recognition in texts. We suggest to describe their syntactic restrictions and their idiosyncratic peculiarities with local grammar rules, which at the same time express in a general way regularities valid for a whole class of MWLs. The local grammars can be written ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003